Modern Applied Statistical Work

Susan VanderPlas

Outline

Applied Statistics “In Theory”

  • Design the experiment
  • Collect the data
    • Neat, perfectly clean data
  • Analyze the data
  • Write the report
  • Act on the conclusions

That probably never happens

Applied Statistics in Practice

  • Messy data
  • Vague expectations
  • Fundamental misunderstandings
  • The curse of “Big Data”

Messy Data

What I expected

  • Poorly coded variables
  • misspellings
  • lack of documentation
  • appalling Excel spreadsheets

What I got

Big Data Problems

Big Data Problems

  • “Needle in a Haystack”
    Finding one interesting thing in 100+ GB of data

  • “Needle in a stack of needles”
    100 interesting things - how to investigate them all?


Limited bandwidth

Big Data

Visualization is an important tool for working with big data

Adaptations must be made:

  • Overplotting (large \(n\))
  • High-dimensional data (large \(p\))
  • Distributed/multi-source data, hierarchical data
  • No solution (binning, dimension reduction, tours) works for every situation

Interactive Graphics

  • Provide additional information in response to user action

  • Simultaneously show more than 2-3 variables and their relationship (multiple linked plots)

  • Accommodate complex data structures

BUT…


Web-based interactive graphics may be even more size-sensitive than static graphics.

Interactive Visualization of Soybean Population Genetic Data

Soybean Project: People and Institutions

Overall Project Goals:

  • Understand historical yield increases
    100% increase in past 100 years; additional 70% increase by 2050 to meet food needs (World Bank)
  • Associate genetic features with phenotypic traits Disease resistance, yield, nutritional content, time to maturity

  • Communicate analysis results intuitively:
    • Target: Soybean farmers, plant geneticists
    • Provide full results (tables) and graphical summaries
    • Interface with existing databases and web resources

Data


  • Sequencing Data
    (79 varieties, 75GB processed and compressed)

  • Field Trials
    (168 varieties, 30 varieties with genetic data)

  • New crosses with highest yield varieties
    (sequencing + field trials)

  • Genealogy as reported in the breeding literature (1600 varieties)

Visualizing SNPs

  • SNP: Single Nucleotide Polymorphism, a single basepair mutation
    (A -> T, G -> A, C -> G)
  • Shiny applet: Responsive applet for user-directed data subsets
  • Show multiple levels of detail (less detail = lower computational load)
  • Provide resources in the applet for user exploration (not just a reference tool)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)
  • 79 varieties, 20 chromosomes
  • Phenotype and genealogy information
  • Researchers tend to work on gene subsets:
    Must be able to zoom and filter
  • Optimized files for SNP results are still large (10 GB) and require significant computational resources

Above all, need an interface to allow people to pull new discoveries from the data systematically.

Applet Design

SNP Population Distribution

SNP Applet Overview

Density of SNPs: Chromosome Level

SNP Density

Individual SNPs: Comparing Varieties

Variety-Level SNP Browser

Genealogy and Phenotypes

Link

SNP Linked Plots

Interactive Plot Design

Good Statistical Graphics

Function:

  • Show the data
  • Don’t distort the data

Form:

  • Show a consistent story
  • Provide several levels of detail
(Ideally)

Elegance:
How do I best communicate the data?

  • Perceptual Awareness
  • Visual Bandwidth (information overload)